Drafting Visualizations

Author

Haylee Oyler

1. Which option do you plan to pursue?

I plan to pursue option 1

2. Restate your questions. Has this changed at all since HW #1? If yes, how so?

Is there a gender gap in academic publishing and if so, what does it look like?

  • What is the gender gap across different academic disciplines?
  • How has the gender gap changed over time?
  • What does the gender gap look like from country to country?

I think these questions have changed slightly. I was originally interested in understanding how the gender publishing gap affects women’s career mobility and opportunities overall. That, however, is a much more complicated question that I do not have the data for. So, I’ve decided to go with a simpler approach.

3. Explain which variables from your data set you will use to answer your questions, and how.

The variables I will use to answer this question are as follows:

  • authors: int; represents the total number of authored publications for a specific country, gender, time period, and field.
  • country: chr; the country of origin of the publications
  • gender: chr; the gender of the publisher
  • subject_area_or_subfield: chr; the academic discipline the paper was published from
  • period: fct; represents one of two time periods (1993-2003) or (2014-2018). There is no information on publications between or after these time periods.
  • geometry: vec; country boundaries so I can map gender differences across countries

All this information comes from one dataset about gender and academic publishing created by Elsevier with the exception of the geometries, that come from a world administrative boundaries shape file.

Most of the variables I can use as-is in my visualizations to answer my questions. There is some wrangling that has to happen, and it mostly involves generating summary statistics of publications by gender and country, or gender and academic discipline.

4. Find at least two data visualizations that you could borrow/adapt pieces from and explain which elements you might borrow.

A visualization showing major industries colored by the gender proportion that work in that field. The words serve as the data themselves, with the letter coloring changing with the proportion.

Source: Georgios Karamanis

I like this data viz as a cool way I might show the gender gap amongst different academic disciplines. Currently, I’m using a bubble chart with the percentage of total publications by gender. I like this viz a lot because it really centers the industries themselves, rather than the numbers. I think this would work nicely with the data I have, however, I am also intimidated about trying to accomplish this.

A dumbbell plot of the disability prevalence in men, women, and intersex people in Kenya. Intersex people have higher rates of disability across all types.

Source: Georgios Karamanis

If I wanted to switch up my variables and display gender gap by country with a dumbbell chart, it could look something like this. I like how the values along the x-axis are also located inside the bubbles themselves, that way you don’t have to work to hard to see what they are. I also like the gender color choices here.

5. Hand-draw your anticipated visualizations

Hand drawn mock up of final project visualization with a map at the top, a dumbbell plot in the bottom left, and a pie chart in the bottom right

Hand-drawn final visualization

6. Mock up all of your hand drawn visualizations using code

Load libraries and settings
library(tidyverse)
library(here)
library(janitor)
library(readxl)
library(patchwork)
library(showtext)
library(glue)
library(ggtext)
library(scales)
library(sf)
library(tmap)
library(grid)
library(gridtext)

# Enable showtext
showtext_auto()

# Ensure showtext is used
showtext_opts(dpi = 300)

font_add_google(name = "Lexend", family = "lexend")
# Read in data
author_stats = read_xlsx(here("data", "authors.xlsx"), sheet = 1) %>% 
  clean_names() %>% 
  mutate(gender = str_to_title(gender))

# Read in countries geometries
countries <- read_sf(here("data", "countries", "world-administrative-boundaries.shp")) %>% 
  select(iso3, name, geometry)
# Filter data into the two different time periods

# Older data
author_old <- author_stats %>% 
  filter(period == "1999-2003")

# Generate some summary stats
old_sum <- author_old %>% 
  group_by(gender) %>% 
  summarise(gender_authors = sum(authors)) %>% 
  ungroup() %>% 
  mutate(total_authors = sum(gender_authors), 
         percent_pub = gender_authors/total_authors)

# More recent data
author_new <- author_stats %>% 
  filter(period == "2014-2018")

# Generate some summary stats
new_sum <- author_new %>% 
  group_by(gender) %>% 
  summarise(gender_authors = sum(authors)) %>% 
  ungroup() %>% 
  mutate(total_authors = sum(gender_authors), 
         percent_pub = gender_authors/total_authors)

Pie chart of total publications by gender over time

Reveal code
# Pie chart for the older years
old_pie <- ggplot(old_sum, aes(x = "", y = percent_pub, fill = gender)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(gender, "\n", scales::percent(percent_pub, accuracy = 1)), family = "lexend"),             
            position = position_stack(vjust = 0.5),  
            color = "white", size = 5) +  
  scale_fill_manual(values = c("Women" = "#ec9bfc", "Men" = "#6A1E99")) + 
  theme_void() +  
  labs(title = "1993-2003",
       fill = "") +
  theme(
    text = element_text(family = "lexend"),
    legend.text = element_text(family = "lexend"),
    plot.title = element_text(hjust=0.5),
    legend.position = "none"
  )

# Pie chart for the more recent years
new_pie <- ggplot(new_sum, aes(x = "", y = percent_pub, fill = gender)) +
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(gender, "\n", scales::percent(percent_pub, accuracy = 1)), family = "lexend"),             
            position = position_stack(vjust = 0.5),  
            color = "white", size = 5) +  
  scale_fill_manual(values = c("Women" = "#ec9bfc", "Men" = "#6A1E99")) + 
  theme_void() +  
  labs(title = "2014-2018",
       fill = "") +
  theme(
    text = element_text(family = "lexend"),
    legend.text = element_text(family = "lexend"),
    plot.title = element_text(hjust=0.5),
    legend.position = "none"
  )

# Stick the two figures together
patchwork <- old_pie + new_pie
patch_final <- patchwork + plot_annotation(
  title = "Total Academic Publications by Gender",
  subtitle = "Women have increased their share of total academic publications by 10% from 1993 to 2018",
  caption = "Source: Elsevier Data Repository | Graphic: Haylee Oyler"
) &
  theme( 
    text = element_text(family = "lexend"),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 14),
    plot.background = element_rect(fill = "white", color = NA),
    legend.text = element_text(size=12, face = "bold"),
    plot.caption = element_text(size=10)
    )
patch_final

Two pie charts showing the changing in the gender breakdown of total academic publications. Women composed 25% of publications between 1993-2003 and they composed 35% of publications between 2014-2018

Save plot
# Save figure 
# ggsave(
#   filename = here::here("images", "pie.png"),
#   plot = patch_final, 
#   device = "png",
#   width = 9, 
#   height = 6,
#   unit = "in",
#   dpi = 300
# )
# Calculate summary stats grouped by field
fields <- author_stats %>% 
  group_by(subject_area_or_subfield, gender) %>% 
  summarise(field_gender = sum(authors)) %>% 
  mutate(total_authors = sum(field_gender),
         percent_field = field_gender/total_authors)


# pivot wider to add columns of the number of authors by field by men or women
fields_wide <- fields %>%
  pivot_wider(
    id_cols = c(subject_area_or_subfield, total_authors),
    names_from = gender,
    values_from = c(field_gender, percent_field),
    names_prefix = ""
  ) %>%
  group_by(subject_area_or_subfield) %>%
  mutate(total_authors = first(total_authors),
         gender_gap = field_gender_Men - field_gender_Women,
         percent_gender_gap = percent_field_Men - percent_field_Women) %>%
  ungroup() %>% 
  filter(subject_area_or_subfield != "ALL")

Dumbbell chart of total publications by gender across disciplines

Reveal code
# Reorder data by the gender gap from high to low
fields_wide <- fields_wide %>% 
  mutate(subject_area_or_subfield = fct_reorder(.f = subject_area_or_subfield, 
                                                .x = percent_gender_gap))

# dumbbell plot
ggplot(fields_wide) +
  geom_linerange(aes(y = subject_area_or_subfield,
                     xmin = percent_field_Women, 
                     xmax = percent_field_Men)) +
  geom_point(aes(x = percent_field_Women, 
                 y = subject_area_or_subfield, 
                 color = "Women"), size = 2.5) +
  geom_point(aes(x = percent_field_Men, 
                 y = subject_area_or_subfield, 
                 color = "Men"),
             size = 2.5) +
  geom_vline(xintercept = .5, linetype = "dashed", color = "gray40") +
  scale_x_continuous(breaks = seq(0, 1, by = 0.1),
                     labels = scales::percent_format(scale = 100)) +  
  scale_color_manual(values = c("Women" = "#ec9bfc", "Men" = "#6A1E99")) + 
  labs(title = "Men Publish More Across Most Academic Fields",
       subtitle = "Percentage of total academic publications by men and women",
       caption = "Source: Elsevier Data Repository | Graphic: Haylee Oyler",
      x = "Percentage of Total Publications",
      y = "",
      color = "Gender") +
  theme_minimal(base_size = 18)  + # was 20 for png
    theme(
    text = element_text(family = "lexend"),
    plot.title = element_text(face = "bold"),
    # plot.subtitle = ggtext::element_textbox(family = "sen",
    #                                         size = rel(1.1),
    #                                         color = "black",
    #                                         width = unit(35, 'cm'),
    #                                         padding = margin(t = 5, r = 0, b = 5, l = 0),                                     margin = margin(t = 2, r = 0, b = 6, l = 0)),
    axis.title = element_text(size=rel(1)),
    axis.text.x = element_text(size=rel(1), face = "bold"),
    axis.text.y = element_text(size=rel(0.8), face = "bold", color = "black"),
    plot.caption = element_text(size=rel(0.6)),
    plot.background = element_rect(fill = "white", color = NA))

A dumbbell chart showing the proportion of academic publishings by gender across different disciplines. Mathematics, engineering, and physics have the highest gender gap of approximately 60%, public health has almost no gender gap, and nursing and psychology display a gender gap toward majority women of 10-20%.

Save plot
# Save plot
# ggsave(
#   filename = here::here("images", "dumbbell.png"),
#   plot = dumbbell_plot, 
#   device = "png",
#   width = 11, 
#   height = 9,
#   unit = "in",
#   dpi = 300
# )

Map of publications by country

# Data wrangling on average number of publications by country
pub <- author_stats %>% 
  group_by(country) %>%
  summarize(total_pub = sum(average_publications),
         pub_men = sum(average_publications[gender=="Men"]),
          pub_women = sum(average_publications[gender == "Women"])) %>% 
  mutate(pub_diff = pub_men - pub_women) 

# Pivot longer for visualization purposes
pub_long <- pub %>% 
  pivot_longer(cols = c(pub_men, pub_women), names_to = "gender", names_prefix = "pub_", values_to = "author_by_gender")
# Join countries geometries
pub_long <- left_join(pub_long, countries, by = join_by(country== iso3))

# Reassert that it's a sf data frame
pub_long <- pub_long %>% 
  st_as_sf() %>% 
  drop_na() %>% 
  filter(name != "Azores Islands") # remove extra country 
Reveal code
# Map of average publications by country
map <- tm_shape(pub_long) +
  tm_fill("pub_diff",
          palette = c("#F2C5FD", "#EC9BFC", "#D67BE6", 
                      "#C05BD0", "#A93CBA", "#8C2AA7", "#6A1E99"),
          title = "Average number of\nadditional papers published\nby men compared to women") +
  tm_layout(main.title = "Difference in Average Publications Across Countries\nBetween Men and Women",
            main.title.size = 1.8, 
            main.title.position = "center",
            main.title.fontface = "bold",
            # subtitle = "Data on gender disparities in academic publications across countries",
            # subtitle.size = 1.2,
            # caption = "Source: Elsevier Data Repository | Visualization: Haylee Oyler",
            # caption.size = 0.9,
            bg.color = "#f9f5fc",
            inner.margins = c(0, 0, 0, 0),  # Adjust margins (top, right, bottom, left)
            legend.position = c(0, 0.08),
            legend.text.size = 1.1,
            legend.title.size = 1.1,
            fontfamily = "lexend",
            frame = FALSE)

# Try to add subtitle caption manually
tmap_arrange(map)

# Subtitle
# grid::grid.text("Data on gender disparities in academic publications across countries", 
#           x = 0.4, y = 0.90, gp = gpar(fontsize = 14, fontfamily = "lexend"))

# Caption
grid::grid.text("Source: Elsevier Data Repository | Graphic: Haylee Oyler", 
          x = 0.5, y = 0.05, gp = gpar(fontsize = 10, fontfamily = "lexend"))

A world map showing the difference in average number of publications by country between men and women. Belgium, Germany, and Denmark have some of the highest disparities while Argentina, Mexico, and Turkey have some of the lowest.

7. Answer the following questions:

  1. What challenges did you encounter or anticipate encountering as you continue to build / iterate on your visualizations in R? If you struggled with mocking up any of your three visualizations (from #6, above), describe those challenges here.

So the mapping did not go nearly as planned. Only having 26 countries makes it difficult to show more meaningful trends. Also, the “country” with the largest gender gap is this eu28 country, which consists of 28 european block countries. I decided to just drop it entirely because it didn’t make sense to visualize, especially because many countries inside eu28 also had their own entries.

  1. What ggplot extension tools / packages do you need to use to build your visualizations? Are there any that we haven’t covered in class that you’ll be learning how to use for your visualizations?

Currently, I’m using tidyverse, glue, showtext, ggtext, and scales. The only one we haven’t explicitly covered in this class is patchwork and grid.

  1. What feedback do you need from the instructional team and / or your peers to ensure that your intended message is clear?

I think feedback around the map would be nice. It was a bit of a last minute idea, and I’m not sure whether to stick with it or scrap it and go back to the original plan (which was a column chart). Also I really like the way my dumbbell plot looks right now, but I do think the idea of doing the words plot like in the example visualization I selected could be cool. Any thoughts about that would be appreciated. Maybe I could visualize something else with the dumbbells?